[AUTOTUNER] Make autotuner take `do_bench` as a parameter by int3 · Pull Request #4496 · triton-lang/triton

int3 · 2024-08-09T20:06:07Z

This makes the autotuner device-agnostic. Instead of having to know about the existence of e.g. do_bench_cudagraph, it can let the callers decide which backend-specific benchmarking function to use.

See discussion in #4417.

int3 · 2024-08-09T20:12:01Z

This depends on #4392 landing first, otherwise cudagraph benchmarking will not work

int3 · 2024-08-22T16:56:13Z

Bump -- just rebased

This makes the autotuner device-agnostic. Instead of having to know about the existence of e.g. do_bench_cudagraph, it can let the callers decide which backend-specific benchmarking function to use. See discussion in triton-lang#4417.

int3 · 2024-09-05T15:47:09Z

Rebased

…g#4496) This makes the autotuner device-agnostic. Instead of having to know about the existence of e.g. do_bench_cudagraph, it can let the callers decide which backend-specific benchmarking function to use. See discussion in triton-lang#4417. --------- Co-authored-by: Keren Zhou <kerenzhou@openai.com>

anmyachev · 2024-10-15T18:50:10Z

-        self.num_warmups = warmup
-        self.num_reps = rep
-        import torch
-        self.use_cuda_graph = use_cuda_graph and torch.cuda.is_available()


Hi @Jokeren, @int3,

Fields self.num_warmups , self.num_reps and self.use_cuda_graph are used by PyTorch to find out what parameters the autotuner was called with:

https://github.com/pytorch/pytorch/blame/5141ade8e30c64e873e14dcc8de233da45d15025/torch/_higher_order_ops/triton_kernel_wrap.py#L829

Can they be left until the corresponding parameters are removed from __init__ signature?

@int3 is driving the effort. It's up to him. I'm OK either way.

…4974) This is a quick follow-up for the recent autotuner/testing changes as in #4496. This PR moves the empty cache creation into the driver code to make the code more device independent.

…g#4496) This makes the autotuner device-agnostic. Instead of having to know about the existence of e.g. do_bench_cudagraph, it can let the callers decide which backend-specific benchmarking function to use. See discussion in triton-lang#4417. --------- Co-authored-by: Keren Zhou <kerenzhou@openai.com>

…riton-lang#4974) This is a quick follow-up for the recent autotuner/testing changes as in triton-lang#4496. This PR moves the empty cache creation into the driver code to make the code more device independent.

…g#4496) This makes the autotuner device-agnostic. Instead of having to know about the existence of e.g. do_bench_cudagraph, it can let the callers decide which backend-specific benchmarking function to use. See discussion in triton-lang#4417. --------- Co-authored-by: Keren Zhou <kerenzhou@openai.com>

…riton-lang#4974) This is a quick follow-up for the recent autotuner/testing changes as in triton-lang#4496. This PR moves the empty cache creation into the driver code to make the code more device independent.

…g#4496) This makes the autotuner device-agnostic. Instead of having to know about the existence of e.g. do_bench_cudagraph, it can let the callers decide which backend-specific benchmarking function to use. See discussion in triton-lang#4417. --------- Co-authored-by: Keren Zhou <kerenzhou@openai.com>

…riton-lang#4974) This is a quick follow-up for the recent autotuner/testing changes as in triton-lang#4496. This PR moves the empty cache creation into the driver code to make the code more device independent.

#5992)  # New contributor declaration - [x] I am not making a trivial change, such as fixing a typo in a comment. - [x] I have written a PR description following these [rules](https://cbea.ms/git-commit/#why-not-how). - [x] I have run `pre-commit run --from-ref origin/main --to-ref HEAD`. - Select one of the following. - [ ] I have added tests. - `/test` for `lit` tests - `/unittest` for C++ tests - `/python/test` for end-to-end tests - [x] This PR does not need a test because `Previous PR has introduced a test`. - Select one of the following. - [x] I have not added any `lit` tests. - [ ] The `lit` tests I have added follow these [best practices](https://mlir.llvm.org/getting_started/TestingGuide/#filecheck-best-practices), including the "tests should be minimal" section. (Usually running Python code and using the instructions it generates is not minimal.) ### Description Related PR: #4496 In the `autotune` decorator, the `do_bench` parameter was omitted when passed to the `Autotuner` constructor, causing `do_bench` to fail to be default. This PR fixes this issue and ensures that the `do_bench` parameter is passed correctly. By this way, we can use `do_bench` parameter instead of `use_cuda_graph` parameters which have been deprecated

triton-lang#5992)  # New contributor declaration - [x] I am not making a trivial change, such as fixing a typo in a comment. - [x] I have written a PR description following these [rules](https://cbea.ms/git-commit/#why-not-how). - [x] I have run `pre-commit run --from-ref origin/main --to-ref HEAD`. - Select one of the following. - [ ] I have added tests. - `/test` for `lit` tests - `/unittest` for C++ tests - `/python/test` for end-to-end tests - [x] This PR does not need a test because `Previous PR has introduced a test`. - Select one of the following. - [x] I have not added any `lit` tests. - [ ] The `lit` tests I have added follow these [best practices](https://mlir.llvm.org/getting_started/TestingGuide/#filecheck-best-practices), including the "tests should be minimal" section. (Usually running Python code and using the instructions it generates is not minimal.) ### Description Related PR: triton-lang#4496 In the `autotune` decorator, the `do_bench` parameter was omitted when passed to the `Autotuner` constructor, causing `do_bench` to fail to be default. This PR fixes this issue and ensures that the `do_bench` parameter is passed correctly. By this way, we can use `do_bench` parameter instead of `use_cuda_graph` parameters which have been deprecated

# New contributor declaration - [x] I am not making a trivial change, such as fixing a typo in a comment. - [x] I have written a PR description following these [rules](https://cbea.ms/git-commit/#why-not-how). - [x] I have run `pre-commit run --from-ref origin/main --to-ref HEAD`. - Select one of the following. - [x] I have added tests. - `/test` for `lit` tests - `/unittest` for C++ tests - `/python/test` for end-to-end tests - [ ] This PR does not need a test because `FILL THIS IN`. - Select one of the following. - [ ] I have not added any `lit` tests. - [x] The `lit` tests I have added follow these [best practices](https://mlir.llvm.org/getting_started/TestingGuide/#filecheck-best-practices), including the "tests should be minimal" section. (Usually running Python code and using the instructions it generates is not minimal.) Fixes #6150. When running on a CPU host, `triton.autotune()` throws an error: ``` RuntimeError: 0 active drivers ([]). There should only be one. ``` This issue was introduced by #4496, which forces the caller to specify `do_bench`. But this may not be easy in a large codebase. Default `do_bench` to `triton.testing.do_bench` when there's no GPU. Add a unit test.

# New contributor declaration - [x] I am not making a trivial change, such as fixing a typo in a comment. - [x] I have written a PR description following these [rules](https://cbea.ms/git-commit/#why-not-how). - [x] I have run `pre-commit run --from-ref origin/main --to-ref HEAD`. - Select one of the following. - [x] I have added tests. - `/test` for `lit` tests - `/unittest` for C++ tests - `/python/test` for end-to-end tests - [ ] This PR does not need a test because `FILL THIS IN`. - Select one of the following. - [ ] I have not added any `lit` tests. - [x] The `lit` tests I have added follow these [best practices](https://mlir.llvm.org/getting_started/TestingGuide/#filecheck-best-practices), including the "tests should be minimal" section. (Usually running Python code and using the instructions it generates is not minimal.) Fixes triton-lang#6150. When running on a CPU host, `triton.autotune()` throws an error: ``` RuntimeError: 0 active drivers ([]). There should only be one. ``` This issue was introduced by triton-lang#4496, which forces the caller to specify `do_bench`. But this may not be easy in a large codebase. Default `do_bench` to `triton.testing.do_bench` when there's no GPU. Add a unit test.

…n 3.1 (#726) Update default parameters of LibTuner to adapt to versions after Triton 3.1 This makes the autotuner device-agnostic. Instead of having to know about the existence of e.g. do_bench_cudagraph, it can let the callers decide which backend-specific benchmarking function to use. See discussion in triton-lang/triton#4496

…4974) This is a quick follow-up for the recent autotuner/testing changes as in triton-lang/triton#4496. This PR moves the empty cache creation into the driver code to make the code more device independent.

Summary: - Let pytest just grab and test all things under a folder directly for dense output - Skip AMD test if not on AMD GPU `third_party/tlx/run_all.sh` now skips `third_party/tlx/tutorials/amd-gemm-pipelined.py` on NV GPU as tested locally ``` % third_party/tlx/run_all.sh Hello! (Facebook-only) Need to build triton in this script? {y|n}n Run all LITs? {y|n}n Run core Triton python unit tests? {y|n}n Run all TLX unit tests? {y|n}n Run TLX tutorial kernels (correctness|performance|no)? {c|p|n} c Verifying correctness of TLX tutorial kernels ============================================================================================ test session starts ============================================================================================ platform linux -- Python 3.11.13, pytest-8.3.4, pluggy-1.5.0 rootdir: /data/users/pchen7e4/triton configfile: pyproject.toml plugins: xdist-3.7.0, forked-1.6.0, typeguard-4.3.0 collected 17 items third_party/tlx/tutorials/amd-gemm-pipelined.py s [ 5%] third_party/tlx/tutorials/blackwell-fa-ws-persistent_test.py . [ 11%] third_party/tlx/tutorials/blackwell-fa-ws-pipelined-persistent_test.py . [ 17%] third_party/tlx/tutorials/blackwell-fa-ws-pipelined_test.py . [ 23%] third_party/tlx/tutorials/blackwell-fa-ws_test.py . [ 29%] third_party/tlx/tutorials/blackwell-gemm-clc.py . [ 35%] third_party/tlx/tutorials/blackwell-gemm-pipelined.py . [ 41%] third_party/tlx/tutorials/blackwell-gemm-ws.py . [ 47%] third_party/tlx/tutorials/blackwell-grouped-gemm.py . [ 52%] third_party/tlx/tutorials/hopper-fa-ws-pipelined-pingpong_test.py s [ 58%] third_party/tlx/tutorials/hopper-fa-ws-pipelined_test.py s [ 64%] third_party/tlx/tutorials/hopper-fa-ws_test.py s [ 70%] third_party/tlx/tutorials/hopper-gemm-pipelined_test.py s [ 76%] third_party/tlx/tutorials/hopper-gemm-ws_test.py s [ 82%] third_party/tlx/tutorials/hopper-persistent-gemm-ws-cooperative.py s [ 88%] third_party/tlx/tutorials/hopper-persistent-gemm-ws-pingpong.py s [ 94%] third_party/tlx/tutorials/vector-add2.py . [100%] ============================================================================================= warnings summary ============================================================================================== python/triton/runtime/autotuner.py:99 python/triton/runtime/autotuner.py:99 python/triton/runtime/autotuner.py:99 /data/users/pchen7e4/triton/python/triton/runtime/autotuner.py:99: DeprecationWarning: warmup, rep, and use_cuda_graph parameters are deprecated. See triton-lang/triton#4496 for details. warnings.warn(("warmup, rep, and use_cuda_graph parameters are deprecated. See " third_party/tlx/tutorials/blackwell-fa-ws-pipelined-persistent_test.py::test_op[triton-fp16-bwd-128-1024-16-8] /data/users/pchen7e4/miniconda3/lib/python3.11/site-packages/torch/autograd/graph.py:824: UserWarning: Attempting to run cuBLAS, but there was no current CUDA context! Attempting to set the primary context... (Triggered internally at /pytorch/aten/src/ATen/cuda/CublasHandlePool.cpp:181.) return Variable._execution_engine.run_backward( # Calls into the C++ engine to run the backward pass -- Docs: https://docs.pytest.org/en/stable/how-to/capture-warnings.html ================================================================================= 9 passed, 8 skipped, 4 warnings in 8.85s ===================== ``` Pull Request resolved: #635 Reviewed By: htyu Differential Revision: D86236535 Pulled By: pchen7e2 fbshipit-source-id: d17e708c39172e01351ec599cb927738236fbf87

…n 3.1 (flagos-ai#726) Update default parameters of LibTuner to adapt to versions after Triton 3.1 This makes the autotuner device-agnostic. Instead of having to know about the existence of e.g. do_bench_cudagraph, it can let the callers decide which backend-specific benchmarking function to use. See discussion in triton-lang/triton#4496

int3 requested review from antiagainst, ptillet and zhanglx13 as code owners August 9, 2024 20:06

int3 mentioned this pull request Aug 9, 2024

[FRONTEND] Use PyTorch device_interface in do_bench #4470

Merged

7 tasks

int3 mentioned this pull request Aug 13, 2024

Enable more pytests triton-lang/triton-cpu#107

Closed

int3 force-pushed the autotune branch from 72d75e9 to d60ab33 Compare August 22, 2024 16:55

int3 mentioned this pull request Sep 5, 2024

Add kernel execution time measurement using hooks for do_bench triton-lang/triton-cpu#139

Merged

Make autotuner take do_bench as a parameter

819719c

This makes the autotuner device-agnostic. Instead of having to know about the existence of e.g. do_bench_cudagraph, it can let the callers decide which backend-specific benchmarking function to use. See discussion in triton-lang#4417.

int3 force-pushed the autotune branch from d60ab33 to 819719c Compare September 5, 2024 15:39

Jokeren reviewed Oct 6, 2024

View reviewed changes

Comment thread python/triton/runtime/autotuner.py Outdated

Jokeren reviewed Oct 6, 2024

View reviewed changes

Comment thread python/triton/runtime/autotuner.py Outdated

Jokeren reviewed Oct 6, 2024

View reviewed changes

Comment thread python/triton/runtime/autotuner.py

Jokeren and others added 4 commits October 6, 2024 11:43

Merge branch 'main' into autotune

3600993

Add missing comments

d912430

Update autotuner.py

a2dec9b

Address comments

179e371

Jokeren approved these changes Oct 7, 2024

View reviewed changes

Jokeren changed the title ~~Make autotuner take do_bench as a parameter~~ [AUTOTUNER] Make autotuner take do_bench as a parameter Oct 7, 2024

Jokeren changed the title ~~[AUTOTUNER] Make autotuner take do_bench as a parameter~~ [AUTOTUNER] Make autotuner take do_bench as a parameter Oct 7, 2024

Merge branch 'main' into autotune

5d8afb3

Jokeren merged commit ab07e54 into triton-lang:main Oct 7, 2024

anmyachev reviewed Oct 15, 2024

View reviewed changes

This was referenced Oct 23, 2024

[testing] A quick follow-up for more device-independent do_bench minjang/triton#4

Closed

[AUTOTUNER] A quick follow-up for more device-independent do_bench #4974

Merged

zhiyuan1i mentioned this pull request Feb 22, 2025

[AUTOTUNER] Fix: Pass do_bench parameter to Autotuner in `autotun… #5992

Merged

7 tasks

ppwwyyxx mentioned this pull request Mar 8, 2025

Triton code cannot be imported on a CPU machine #6150

Closed

zhuhan0 mentioned this pull request Apr 23, 2025

[autotuner] Add default do_bench for CPU hosts #6570

Merged

7 tasks

zhzhcookie mentioned this pull request Jun 30, 2025

[LibTuner] Update default parameters of LibTuner flagos-ai/FlagGems#726

Merged

3 tasks

TheEpicDolphin mentioned this pull request Jul 17, 2025

Add tree attention backend for v1 (part 1) vllm-project/vllm#20401

Merged

factnn mentioned this pull request Dec 2, 2025

【Triton Copilot】Enhance test coverage for aten::index operator flagos-ai/FlagGems#1083

Merged

danielhanchen mentioned this pull request Feb 5, 2026

Silence third-party deprecation warnings and fix socket leak unslothai/unsloth#3983

Merged

3 tasks

jiahuiwen-baai mentioned this pull request Apr 16, 2026

The mark for the scatter and scatter_ operators has been modified. The operator names for accuracy and performance no longer correspond one-to-one, resulting in missing performance results. flagos-ai/FlagGems#2465

Closed

WillWillWong mentioned this pull request Apr 30, 2026

8*910b2. When deploying deepseek-v4-flash on Flagos, there were always errors. flagos-ai/FlagRelease#10

Open

jiahuiwen-baai mentioned this pull request May 9, 2026

The mm and index operator accuracy test fails with ValueError: math domain error flagos-ai/FlagGems#2995

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[AUTOTUNER] Make autotuner take `do_bench` as a parameter#4496

[AUTOTUNER] Make autotuner take `do_bench` as a parameter#4496
Jokeren merged 6 commits into
triton-lang:mainfrom
int3:autotune

int3 commented Aug 9, 2024

Uh oh!

int3 commented Aug 9, 2024

Uh oh!

int3 commented Aug 22, 2024

Uh oh!

int3 commented Sep 5, 2024

Uh oh!

Uh oh!

Uh oh!

Uh oh!

anmyachev Oct 15, 2024

Uh oh!

Jokeren Oct 15, 2024

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

int3 commented Aug 9, 2024

Uh oh!

int3 commented Aug 9, 2024

Uh oh!

int3 commented Aug 22, 2024

Uh oh!

int3 commented Sep 5, 2024

Uh oh!

Uh oh!

Uh oh!

Uh oh!

anmyachev Oct 15, 2024

Choose a reason for hiding this comment

Uh oh!

Jokeren Oct 15, 2024

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants